arXiv:1507.07089vl [math.ST] 25Jul2015 


PROPER SCORING AND SUFFICIENCY 


Peter Harremoes 

Niels Brock, Copenhagen Business College, 
Copenhagen, DENMARK, harremoes@ieee.org 


ABSTRACT 

Logarithmic score and information divergence appear in 
both information theory, statistics, statistical mechanics, 
and portfolio theory. We demonstrate that all these top¬ 
ics involve some kind of optimization that leads directly 
to the use of Bregman divergences. If a sufficiency con¬ 
dition is also fulfilled the Bregman divergence must be 
proportional to information divergence. The sufficiency 
condition has quite different consequences in the different 
areas of application, and often it is not fulfilled. Therefore 
the sufficiency condition can be used to explain when re¬ 
sults from one area can be transferred directly from one 
area to another and when one will experience differences. 

1. INTRODUCTION 

The use of scoring rules has a long history in statistics. An 
early contribution was the idea of minimizing the sum of 
square deviations that dates back to Gauss and works per¬ 
fectly for Gaussian distributions. In the 1920’s Ramsay 
and de Finetti proved versions of the Dutch book theo¬ 
rem where determination of probability distributions were 
considered as dual problems to maximizing a payoff func¬ 
tion. Later it was proved that any consistent inference cor¬ 
responds to optimizing with respect to some payoff func¬ 
tion. A more systematic study of scoring rules was given 
by McCarthy III and has recently been studied by Dawid, 
Lauritzen and Parry 0 where the notion of a local scoring 
rule has been extended. The basic result is that the only 
strictly local proper scoring rule is logarithmic score. 

Thermodynamics is the study of concepts like heat, 
temperature and energy. A major objective is to extract as 
much energy from a system as possible. Concepts like en¬ 
tropy and free energy play a significant role. The idea in 
statistical mechanics is to view the macroscopic behavior 
of a thermodynamic system as a statistical consequence 
of the interaction between a lot of microscopic compo¬ 
nents where the interacting between the components are 
governed by very simple laws. Here the central limit the¬ 
orem and large deviation theory play a major role. One 
of the main achievements is the formula for entropy as a 
logarithm of a probability. 

One of the main purposes of information theory is to 
compress data so that data can be recovered exactly or ap¬ 
proximately. One of the most important quantities was 
called entropy because it is calculated according to a for¬ 
mula that mimics the calculation of entropy in statistical 


mechanics. Another key concept in information theory 
is information divergence (KL-divergence) that was intro¬ 
duced by Kullback and Leibler in 1951 in a paper entitled 
information and sufficiency. The link from information 
theory back to statistical physics was developed by E.T. 
Jaynes via the maximum entropy principle. The link back 
to statistics is now well established BIS CD - 

The relation between information theory and gambling 
was established by Kelly(5). Logarithmic terms appear 
because we are interested in the exponent in an exponen¬ 
tial growth rate of of our wealth. Later Kelly’s approach 
has been generalized to training of stocks although the re¬ 
lation to information theory is weaker |7j. 

Related quantities appear in statistics, statistical me¬ 
chanics, information theory and finance, annd we are in¬ 
terested in a theory that describes when these relations are 
exact and when they just work by analogy. First we intro¬ 
duce some general results about optimization on convex 
sets. This part applies exactly to all the topics under con¬ 
sideration and lead to Bregman divergences. Secondly, we 
introduce a notion of sufficiency and show that this leads 
to information divergence and logarithmic score. This 
second step is not always applicable which explains when 
the different topics are really different. 

Proofs of the theorems in this short paper can be found 
in an appendix that is part of the arXiv version of the pa¬ 
per. 

2. STATE SPACE 

The present notion of a state space is based on 0. and is 
mainly relevant for quantum systems. 

Before we do anything we prepare our system. Let V 
denote the set of preparations. Let po and p-\ denote two 
preparations. For t £ [0,1] we define (1 — t) -pa + t-pi as 
the preparation obtained by preparing po with probability 
1 — t and t with probability t. A measurement m is defined 
as an affine mapping of the set of preparations into a set 
of probability measures on some measurable space. Let 
At denote a set of feasible measurements. The state space 
S is debited as the set of preparations modulo measure¬ 
ments. Thus, if pi and p -2 are preparations then they rep¬ 
resent the same state if m (pi) = m (p 2 ) for any m £ At¬ 
tn statistics the state space equals the set of prepara¬ 
tions and has the shape of a simplex. The symmetry group 
of a simplex is simply the group of permutations of the 
extreme points. In quantum theory the state space has the 


shape of the density matrices on a complex Hilbert space 
and the state space has a lot of symmetries that a simplex 
does not have. For simplicity we will assume that the state 
space is a finite dimensional convex compact space. 

3. OPTIMIZATION 

Let A denote a subset of the feasible measurements A4 
such that a € A maps S into a distribution on the real 
numbers i.e. a random variable. The elements of A may 
represent actions like the score of a statistical decision, 
the energy extracted by a certain interaction with the sys¬ 
tem, (minus) the length of a codeword of the next encoded 
input letter using a specific code book, or the revenue 
of using a certain portfolio. For each s £ S we define 
F (s) = sup oe _4 E [a («)]. We note that F is convex but 
F need not be strictly convex. We say that a sequence of 
actions (a n ) is asymptotically optimal for the state s if 
E [a n (a)] —> F (s) for n —> oo. 

If the state is Si but one acts as if the state were s 2 one 
suffers a regret that equals the difference between what 
one achieves and what could have been achieved. 

Definition 1. If F (si) is finite the regret is defined by 

Dp (si, s 2 ) = F (si) - sup limsup E [a n (si)] (1) 

(a„) n n-toc 

where the supremum is taken over all sequences (a n ) n 
that are asymptotically optimal over s 2 . 

Proposition 2. The regret Dp has the following proper¬ 
ties: 

• Dp (si, S 2 ) > 0 with equality if si = s 2 . 

• ■ d f (Si, s) > E ti ■ Dp (, Si , s) + Dp (s, a) 

where (£ 1 , £ 2 3 , tf) is a probability vector and s = 

E U ' s i- 

• E U • Df ( Si , s) is minimal when a = E ‘ s i- 

If the state space is finite dimensional and there exists 
a unique action 0,2 such that F (S 2 ) = F[a(s 2 )] then 
Df (si, S 2 ) = E [ai (si)] - E [a 2 (si)]. If unique op¬ 
timal actions exists for any state then F is differentiable 
which implies that the regret can be written as a Bregman 
divergence in the following form 

D f (ai, a 2 ) = F (si) - ( F (s 2 ) + (si - a 2 , VF (a 2 ))). 

( 2 ) 

In the context of forecasting and statistical scoring rules 
the use of Bregman divergences dates back to f9[. 

We note that D Fl (si, s 2 ) = Dp 2 (si, s 2 ) if and only 
if F\ (s) — F 2 (s) is an affine function of s. If the state s 2 
has the unique optimal action a 2 then 

F (st) = Dp (si, s 2 ) + E [a 2 (si)] (3) 

so the function F can be reconstmcted from Dp except for 
an affine function of sq. The closure of the convex hull of 
the set of functions s —> E [a (s)] is uniquely determined 
by the convex function F. 


4 . SUFFICIENCY 

Let (sg)g denote a family of states and let '\> denote a com¬ 
pletely positive transformation S —> T where S and T 
denote state spaces. Then <1» is said to be sufficient for 
(sg) g if there exists a completely positive transformation 
'L : T —> S such that ^ (<h (sg)) = sg. 

We say that the regret Dp on the state space S sat¬ 
isfies the sufficiency property if Dp (<f> (si), (s 2 )) = 

Dp (si, s 2 ) for any completely positive transformation S —> 
S that is sufficient for (si, s 2 ). The notion of sufficiency 
as a property of divergences was introduced in ED. The 
cmcial idea of restricting the attention to transformations 
of the state space into itself was introduced in El- 

Theorem 3. Assume that S is a state space. If the diver¬ 
gence Dp satisfies the sufficiency property then for any 
state s and any completely positive transformation *1) : 

S —> S one has F (<f> (s)) = F (s). 

If the alphabet size is two the above condition on F is 
sufficient to conclude that 

D F ('f>(s 1 ),'f>(s 2 ))=D F (s 1 ,s 2 ). ( 4 ) 

Theorem 4. Assume that the state space S is a classical 
or quantum state space on three or more letters. If the 
regret Dp satisfies the sufficiency property, then F is pro¬ 
portional to the entropy function and Dp is proportional 
to information divergence (relative entropy). 

This theorem can be proved via a numer of partial re¬ 
sults as explained in the next section. 

5. APPLICATIONS 

5.1. Statistics 

Consider an experiment with X = {1,2, ...,£} as sam¬ 
ple space. A scoring rule f is defined as a function with 
domain X x M+ (X) —>• R such that the score is / (x, Q) 
when the prediction was given by Q and x £ X has been 
observed. A scoring rule is proper if for any probability 
measure P G M+ (X) the score E^e* D (x) ■ f (x, Q) 
is minimal when Q = P. 

Theorem 5. The scoring rule f is proper is and only if 
there exists a smooth function F such that f (x, Q) = 

D f ( S X ,Q) + / ( x). 

Definition 6. A strictly local scoring rule is a scoring rule 
of the form / (x, Q) = g (Q (x)). 

Lemma 7. On a finite space a Bregman divergence that 
satisfies the sufficiency condition gives a strictly local scor¬ 
ing rule. 

The following theorem was given in itTTl with a much 
longer proof. 

Theorem 8. On a finite alphabet with at least three letters 
a Bregman divergence that satisfies the sufficiency condi¬ 
tion is proportional to information divergence. 


Proof. Since any strictly local proper scoring rule corre¬ 
sponds to separable divergence a divergence that is Breg- 
man and satisfies sufficiency must also be separable. If 
the alphabet size is at least three the only separable diver¬ 
gences that are Bregman divergences are the ones propor¬ 
tional to information divergence Col. □ 

5.2. Information theory 

Let bi, 62 ,..., b n denote the letters of an alphabet and let 
£ (re (bi)) denote the length of the codeword re (b, ) accord¬ 
ing to some code book re. If the code is uniquely decodable 
then 2~ e ( K ( bi ^ < 1. Note that £ (re (bi)) is an integer. 
If only integer values of £ are allowed then h is piece- 
wise linear and sufficiency is not fulfilled. If arbitrary real 
numbers are allowed then it obvious we get a proper local 
scoring mle. 

5.3. Statistical mechanics 

Statistical mechanics can be stated based on classical me¬ 
chanics or quantum mechanics. For our purpose this makes 
no difference because Theorem[4]can be applied for both 
classical systems and quantum systems. 

Proof of Theorem^ If we restrict to any commutative sub¬ 
algebra the divergence is proportional to information di¬ 
vergence as stated in Theorem [ 8 ] so that F is proportional 
to the entropy function H restricted to the sub-algebra. 
Any state generates a commutative sub-algebra so the func¬ 
tion F is proportional to H on all states and the divergence 
is proportional to information divergence. □ 

Assume that a heat bath of temperature T is given and 
that all the states are close to the state of the heat bath. An 
action a £ A is some interaction with the thermodynamic 
system that extracts some energy from the system. In 
thermodynamics the quantity F (s) = sup a €A £ [«(s)] 
is normally called the free energy. If the temperature is 
kept fixed under all interactions F is called Helmholtz 
free energy. Any sufficient transformation $ for si and 
s 2 is quasi-static and can be approximately realized by a 
physical process T that is reversible in the thermodynamic 
sense of the word. 

D f ($ (si), (s 2 )) = a Hsi) ($ (si)) - a $(s2) ($ (si)). 

(5) 

Now 

a $(s 2 ) ($ (s 2 )) = (a$( S2 ) ° $) (s 2 ) 

< a 2 (s 2 ) = a 2 (T ($ (s 2 ))) 

= (a 2 o 'I') ($ (s 2 )) < a$ (s2 ) ($ (s 2 )). ( 6 ) 

Hence a$( S2 ) = a 2 o 4/ so that 

D F (^( Sl )^(s 2 )) 

= (a\ o vp) (<P (sr)) - (a 2 o vp) (<f> (si)) 

= at (st) - a 2 (si) = D F (si,s 2 ). (7) 


The amount of extractable energy Ex is proportional to 
information divergence. The quotient between extractable 
energy and information divergence depends on the tem¬ 
perature and one may even define the absolute tempera¬ 
ture via the formula 

Ex = kT ■ D ( Sl \\s 2 ) ( 8 ) 

where k = 1.381 • 10 -23j /k is Boltzmann’s constant. 
Equation <[ 8 ]> was derived already in lfl2l by a similar ar¬ 
gument. 

According to Equation any bit of information can 
be converted into an amount of energy! One may ask how 
this is related to the mixing paradox (a special case of 
Gibbs’ paradox). Consider a container divided by a wall 
with a blue and a yellow gas on each side of the wall. The 
question is how much energy can be extracted by mixing 
the gasses? 



We loose one bit of information about each molecule 
by mixing the gasses, but if the color is the only difference 
no energy can be extracted. This seems to be in conflict 
with Equation ([ 8 ]), but in this case different states cannot 
be converted into each other by reversible processes. For 
instance one cannot convert the blue gas into the yellow 
gas. To get around this problem one can restrict the set of 
preparations and one can restrict the set of measurements. 
For instance one may simply ignore measurements of the 
color of the gas. What should be taken into account and 
what should be ignored, can only be answered by an ex¬ 
perienced physicist. Formally this solves the mixing para¬ 
dox but from a practical point of view nothing has been 
solved. If for instance the molecules in one of the gasses 
are much larger than the molecules in the other gas then 
a semi-permeable membrane can be used to create an os¬ 
motic pressure that can be used to extract some energy. It 
is still an open question which differences in properties of 
the two gasses that can be used to extract energy. 

5.4. Portfolio theory 

Let X \. X 2 .... ,Xk denote price relatives for a list of 
stocks. For instance A - 5 = 1.04 means that stock no. 5 in¬ 
creases its value by 4 %. A portfolio is a probability vector 
b = ( 61 , b 2 ,... , bk) where for instance b$ = 0.3 means 
that 30 % of your money is invested in stock no. 5. The 
total price relative is Xi-bi+X 2 -b 2 + - ■ -+Xk-bk = X■ b. 
We now consider a situation where the stocks are traded 






once every day. For a sequence of price relative vectors 
X\. X' 2 , ■ ■ ■ X n and a constant re-balancing portfolio b 
the wealth after n days is 


s n =ii(xj 


(9) 


According to law of large numbers 


-log (S n )->E log (X,b 
n L \ 


( 10 ) 


Here E 


and is denoted W 


log (X,b 


is proportional to the doubling rate 




where P indicates the probabil¬ 


ity distribution of A'. Our goal to maximize W y>. PJ by 

choosing an appropriate portfolio b. 

Let bp denote the portfolio that is optimal for P. As 
proved in (7) 

w(b P ,p)-w(b Q ,P)<D(P\\Q). (11) 

Theorem 9. The Bregman divergence 


w(b P ,p)-w(b Q ,p) 

satisfies the equation 


( 12 ) 


about what the sufficient variables are, but if the sufficient 
variables have been specified we have the mathematical 
framework to develop the rest of the theory in a consistent 
manner. 
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Appendix 

Proof of Theorem H] 


Proof of Theorem [9] 

We have 


If / is given in terms of a regret function Dp then 

Y P{x) ■ f(x,Q) 

xex 

= Y p ( x ) ■ ( d f (S x ,Q) + g (x)) 

xex 

>Y p ^)' d f (S x , P) + D f (P, Q) 
xex 

+ Y P ( x )-9 (x) (14) 

xex 

because P = Yxex p (■ c ) ' h p is smooth then 
Dp (P, Q) = 0 if and only if Q = P. 

Assume that / is proper. Then we may define a diver¬ 
gence by 

D(P,Q) = Y P ( x ) ’ / ( x > Q) - Y P ' ? ( x ’ p ) • 

xEX xEX 

(15) 

Since / is assumed to be proper D (P, Q) > 0 with equal¬ 
ity if and only if P = Q. The equality YU-Dp (Pi, Q) = 
YU-Dp ^ Pj , P'j + Dp ^P, Qj follows by straight for¬ 
ward calculations. With these two results we see that D 
equals a Bregman divergence Dp and that 

Dp ( 5 y , Q) 

= Y s v( x )'f ( x > Q)~Y 6 y W A) 

x£X x£X 

= f (y,Q) - / (y,s y ) ■ (16) 


w(b P ,p)-w(b Q ,p) 

= (log (x, bp^ dPX - f log (^X, b Q ^ (IPX 


= / log 


\ X,b P 


X, b, 


(IPX 




= / log 


(x, bp 
(X,b Q 


°Q 

d Q 

dP 


dPX + D(P\\Q). (19) 


Next we use Jensen’s inequality to get 


w(b P ,p) -w(b Q ,p) 

< log 


(x,b P '> 

)-Y--^dPx 

(x,b Q \ dP 




D(P\\Q ) 


= log 


(x,bp) 


d QX +D(P\\Q) 


V X ’ b 

< log (1) + D (P\\Q) = D (P\\Q) . (20) 

Jensen’s inequality holds with equality if and only if 


Hence f (x, Q ) = Dp ( S X ,Q ) + / (x, S x ). 

Proof of Lemma |7] 

Let Dp denote a regret function that satisfies the suffi¬ 
ciency condition. Then 


D F ( 8 i,(q 1 ,q 2 , ...,qt)) 

= Dp (<5i, (q i; qi+1 ,..., qi-2, Qi-i)) (17) 

where we have made a cyclic permutation of indices. Next 
we use the sufficient transformation that projects a mixture 
of T| and a uniform distribution. 


D F (Si,(qi,q2,---,qe)) 


= Dp 
= Dp 



f 1 - Qi 1 - Qi 

\ qi1 £ — 1 ’ £ — 1 ’' " 


W)) 



Note that the projection can be obtained by taking a mix¬ 
ture of all permutations of the extreme points that leave 
the first extreme point unchanged. Hence the scoring rule 
is given by the local scoring rule 

g(p)=D F -tLI))- 


(X’hp) _ d Q 

(xXq) dP 


( 21 ) 


is constant P-almost surely. Equivalently ^ is propor¬ 
tional to for any probability measure Q on the sup- 


port of X. The set of vectors 6 q lie in a k — 1 dimensional 
convex set. Therefore the set of probability measures on 
the support of P is at most k — 1 dimensional. Hence P 
is supported on at most k vectors in . 

The inequality 


(x,bp 

(x,b Q 


d QX < 1 


( 22 ) 


holds 

lx,b F 


with equality if (bo'j = 0 implies (bp'j = 0. If 


. = k ■ 45 we have 
X,b Q ) d Q 


(x,b P 

(- X,bQ 


d QX = [ k-^pr d QX = k (23) 

J d Q 





















so < 4T. The reversed inequality is proved in the 

same way so we get 


d p _ (x^p) 
AQ ( X,b Q y 


(24) 


Equation ( |24| gives an affine bijection between distribu¬ 
tions P and portfolios bp. The set of portfolios is a sim¬ 
plex with k extreme points so the set of distributions must 
also be a simplex with k extreme points. Therefore sup¬ 
port of the probability measures is a set of k vectors. We 
denote the vector of price relatives corresponding to bs 
by Xj and the i’th coordinate of this vector by (a ?j) i . 

The portfolio b = (bi, & 2 , • • •, bk) is optimal for the 
probability distribution with weight bj on the vector Xj. 
According to the Kuhn-Tucker conditions J7J Thm. 15.2.1] 
the vector b = (b\, 62 , ■ • •, bk) is optimal for the probabil¬ 
ity distribution with weight bj on the vector xj if 


C 25 ) 

3 =1 \b,xjj 


with equality for all i for which b, > 0. Assume that bj = 
1 5 1 (j). Then we get the inequality 


(#)i 

( x t)e 


< 1 


(26) 


or, equivalently, (x}) i < (x}) £ . 

If be = s > 0 and b m — t > 0 where s + t = 1 then 


1 = 


Hence 

b m ' ( X z) m _ ( x m)e ' 

b Z ' ( x e)Z 4” b m ' ( X &)m ' ( X m)e T * (xm) m 

(30) 
and 

( X f)m = _ (x7n) g _ 

be ■ (xe) e + (xe)m be ■ (x m)e 4” b m • (x m ) m 

(31) 

which is equivalent to 


( x t)i _ ^ ( x m)e 


b,x^j (b,Xm^ 


{x g )e ■ be 


be ■ {xe) t + b m ■ (. x g ) r 
\ x m)e ' bm 


+ 


be ' (Xm) g b m * [Xm)m 


(27) 

(28) 
(29) 


( X e) m (bi ' ( x m)e + b m ■ ( x m) m ) 

= {Xm)e (be ■ (xe) t + b m ■ (xe) m ). (32) 

This should hold for all positive be,b m for which be + 
b m = 1 so it also holds for the limiting value b m = 0 
where the equality reduces to 


so that either (xm) g = 0 or (xe) m = ( xe) t . Similarly 
we get (x}) m = 0 or (Xm) e = (xm) m . Together we get 
either (a P m )e = 0 and (xe) m = 0 , or (x g ) m = (x g ) t 
and (xm)e = ( x m) m ■ Therefore (a P m ) g = 0 or (Xm) t = 

( x m) m - 

Let ~ denote the relation on {1, 2,3,... ,k} defined 
by £ ~ m when (xm) e = (a "m) m - The relation ~ is 
obviously reflexive, and as we have seen it is symmet¬ 
ric. We will prove that ~ is transitive. Assume that £ ~ 

to and to ~ n. Then (x g ) t = (xe) m and (xm) g = 
( x m) m =(Xm) n and (x n ) m = \x n ) n . Assume further that 
£ 00 n so that (xe) = (x^) g = 0- Assume that be = s > 
0 , b m = t > 0 ,and b n = u > 0 , and s + t + u = 1 


1 


1 


1 


(xe) r 


• 5 


s (xe)e + t (xe) m + u (xe), 

+ _ ( X ™)n _ 

s (Xm)e + t (Xm) m + u (Xm). r 
_|_ ( X n) n _ 

S (x n )e + t (X n ) m + U (Xn) n 

0 


s (xe) t +1 (x g ) t 

, ( x m) r 


+ 


t 


« (*m) m + t (Xm) m + u ( X ™)r 

( g ")n u 

t (X n ) n + U (X n ) n 
u 


t + u 


(34) 

(35) 

(36) 

(37) 

(38) 

(39) 

(40) 


This should hold for all s, t, u which is a contradiction. 
Therefore £ ~ n and we conclude that ~ is transitive. 

Since ~ is transitive either x) and a,y n are orthogo¬ 
nal or they are parallel with price relatives that are either 
zero or have the same price relatives that are the same for 
stock £ and stock to. Therefore we may consider stock £ 
and stock to as the same stock. Hence we may exclude 
the case where vectors are parallel, so all the vectors are 
orthogonal but this is only possible if the vectors are pro¬ 
portional to the basis vectors. 


(xe) m (Xm)e = (Xm)e (xe) 


( 33 ) 























